The 2019-2020 Coronavirus Pandemic Analysis

Contact: Smith Research

BACKGROUND & APPROACH

I wanted to track and trend the coronavirus outbreak on my own curiosity. There are some interesting questions that may fall out of this, as it is a very historic moment, including scientifically and analytically (we have a large amount of data being shared across the globe, analyzed in real-time). The world has come to a halt because of it.
This analysis attempts to answer the following questions (more to come):

  1. What does the trend of the pandemic look like to date?
  2. What are future case predictions based on historical model?
  3. What interesting quirks or patterns emerge?

ASSUMPTIONS & LIMITATIONS: * This data is limited by the source. I realized early on that depending on source there were conflicting # of cases. Originally I was using JHU data… but this was always ‘ahead’ of the Our World In Data. I noticed that JHU’s website was buggy- you clicked on the U.S. stats but it didn’t reflect the U.S.. So I changed data sources to be more consistent with what is presented in the media (and Our World In Data has more extensive plots I can compare my own to). An interesting aside might be why the discrepancy? Was I missing something?
* Defintiions are important as is the idea that multiple varibales accumulate in things like total cases (more testing for example).

SOURCE RAW DATA: * https://ourworldindata.org/coronavirus
* https://github.com/CSSEGISandData/COVID-19/
*

INPUT DATA LOCATION: github (https://github.com/sbs87/coronavirus/tree/master/data)

OUTPUT DATA LOCATIOn: github (https://github.com/sbs87/coronavirus/tree/master/results)

TIMESTAMP

Start: ##—— Tue Jun 2 10:16:31 2020 ——##

PRE-ANALYSIS

The following sections are outside the scope of the ‘analysis’ but are still needed to prepare everything

UPSTREAM PROCESSING/ANALYSIS

  1. Google Mobility Scraping, script available at get_google_mobility.py
# Mobility data has to be extracted from Google PDF reports using a web scraping script (python , written by Peter Simone, https://github.com/petersim1/MIT_COVID19)

# See get_google_mobility.py for local script 

python3 get_google_mobility.py
# writes csv file of mobility data as "mobility.csv"

SET UP ENVIORNMENT

Load libraries and set global variables

# timestamp start
timestamp()
## ##------ Tue Jun  2 10:16:31 2020 ------##

# clear previous enviornment
rm(list = ls())

##------------------------------------------
## LIBRARIES
##------------------------------------------
library(plyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()   masks plyr::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x dplyr::mutate()    masks plyr::mutate()
## x dplyr::rename()    masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plot.utils)
library(utils)
library(knitr)

##------------------------------------------

##------------------------------------------
# GLOBAL VARIABLES
##------------------------------------------
user_name <- Sys.info()["user"]
working_dir <- paste0("/Users/", user_name, "/Projects/coronavirus/")  # don't forget trailing /
results_dir <- paste0(working_dir, "results/")  # assumes diretory exists
results_dir_custom <- paste0(results_dir, "custom/")  # assumes diretory exists


Corona_Cases.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
Corona_Cases.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv"
Corona_Deaths.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv"
Corona_Deaths.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

Corona_Cases.fn <- paste0(working_dir, "data/", basename(Corona_Cases.source_url))
Corona_Cases.US.fn <- paste0(working_dir, "data/", basename(Corona_Cases.US.source_url))
Corona_Deaths.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.source_url))
Corona_Deaths.US.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.US.source_url))
default_theme <- theme_bw() + theme(text = element_text(size = 14))  # fix this
##------------------------------------------

FUNCTIONS

List of functions

function_name description
prediction_model outputs case estumate for given log-linear moder parameters slope and intercept
make_long converts input data to long format (specialized cases)
name_overlaps outputs the column names intersection and set diffs of two data frame
find_linear_index finds the first date at which linearaity occurs
##------------------------------------------
## FUNCTION: prediction_model
##------------------------------------------
## --- //// ----
# Takes days vs log10 (case) linear model parameters and a set of days since 100 cases and outputs a dataframe with total number of predicted cases for those days
## --- //// ----
prediction_model<-function(m=1,b=0,days=1){
  total_cases<-m*days+b
  total_cases.log<-log(total_cases,10)
  prediction<-data.frame(days=days,Total_confirmed_cases_perstate=total_cases)
  return(prediction)
}
##------------------------------------------

##------------------------------------------
## FUNCTION: make_long
##------------------------------------------
## --- //// ----
# Takes wide-format case data and converts into long format, using date and total cases as variable/values. Also enforces standardization/assumes data struture naming by using fixed variable name, value name, id.vars, 
## --- //// ----
make_long<-function(data_in,variable.name = "Date",
                   value.name = "Total_confirmed_cases",
                   id.vars=c("case_type","Province.State","Country.Region","Lat","Long","City","Population")){

long_data<-melt(data_in,
                id.vars = id.vars,
                variable.name=variable.name,
                value.name=value.name)
return(long_data)

}
##------------------------------------------

## THIS WILL BE IN UTILS AT SOME POINT
name_overlaps<-function(df1,df2){
i<-intersect(names(df1),
names(df2))
sd1<-setdiff(names(df1),
names(df2))
sd2<-setdiff(names(df2),names(df1))
cat("intersection:\n",paste(i,"\n"))
cat("in df1 but not df2:\n",paste(sd1,"\n"))
cat("in df2 but not df1:\n",paste(sd2,"\n"))
return(list("int"=i,"sd_1_2"=sd1,"sd_2_1"=sd2))
}

##------------------------------------------

##------------------------------------------
## FUNCTION: find_linear_index
##------------------------------------------
## --- //// ----
# Find date at which total case data is linear (for a given data frame) 
## --- //// ----

find_linear_index<-function(tmp,running_avg=5){
  tmp$Total_confirmed_cases_perstate.log<-log(tmp$Total_confirmed_cases_perstate,2)
  derivative<-data.frame(matrix(nrow = nrow(tmp),ncol = 4))
  names(derivative)<-c("m.time","mm.time","cumsum","date")
  
  # First derivative
  for(t in 2:nrow(tmp)){
    slope.t<- tmp[t,"Total_confirmed_cases_perstate.log"]- tmp[t-1,"Total_confirmed_cases_perstate.log"]
    derivative[t,"m.time"]<-slope.t
    derivative[t,"date"]<-as.Date(tmp[t,"Date"])
  }
  
  # Second derivative
  for(t in 2:nrow(derivative)){
    slope.t<- derivative[t,"m.time"]- derivative[t-1,"m.time"]
    derivative[t,"mm.time"]<-slope.t
  }
  
  #Compute running sum of second derivative (window = 5). Choose point at which within 0.2
  for(t in running_avg:nrow(derivative)){
    slope.t<- sum(abs(derivative[t:(t-4),"mm.time"])<0.2,na.rm = T)
    derivative[t,"cumsum"]<-slope.t
  }
  
  #Find date -5 from the stablility point
  linear_begin<-min(derivative[!is.na(derivative$cumsum) & derivative$cumsum==running_avg,"date"])-running_avg
  
  return(linear_begin)
}

READ IN DATA

# Q: do we want to archive previous versions? Maybe an auto git mv?

##------------------------------------------
## Download and read in latest data from github
##------------------------------------------
download.file(Corona_Cases.source_url, destfile = Corona_Cases.fn)
Corona_Totals.raw <- read.csv(Corona_Cases.fn, header = T, stringsAsFactors = F)

download.file(Corona_Cases.US.source_url, destfile = Corona_Cases.US.fn)
Corona_Totals.US.raw <- read.csv(Corona_Cases.US.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.source_url, destfile = Corona_Deaths.fn)
Corona_Deaths.raw <- read.csv(Corona_Deaths.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.US.source_url, destfile = Corona_Deaths.US.fn)
Corona_Deaths.US.raw <- read.csv(Corona_Deaths.US.fn, header = T, stringsAsFactors = F)

# latest date on all data:
paste("US deaths:", names(Corona_Deaths.US.raw)[ncol(Corona_Deaths.US.raw)])
## [1] "US deaths: X6.1.20"
paste("US total:", names(Corona_Totals.US.raw)[ncol(Corona_Totals.US.raw)])
## [1] "US total: X6.1.20"
paste("World deaths:", names(Corona_Deaths.raw)[ncol(Corona_Deaths.raw)])
## [1] "World deaths: X6.1.20"
paste("World total:", names(Corona_Totals.raw)[ncol(Corona_Totals.raw)])
## [1] "World total: X6.1.20"

PROCESS DATA

  • Convert to long format
  • Fix date formatting/convert to numeric date
  • Log10 transform total # cases
##------------------------------------------
## Combine death and total data frames
##------------------------------------------
Corona_Totals.raw$case_type<-"total"
Corona_Totals.US.raw$case_type<-"total"
Corona_Deaths.raw$case_type<-"death"
Corona_Deaths.US.raw$case_type<-"death"

# for some reason, Population listed in US death file but not for other data... Weird. When combining, all datasets will have this column, but US deaths is the only useful one.  
Corona_Totals.US.raw$Population<-"NA" 
Corona_Totals.raw$Population<-"NA"
Corona_Deaths.raw$Population<-"NA"

Corona_Cases.raw<-rbind(Corona_Totals.raw,Corona_Deaths.raw)
Corona_Cases.US.raw<-rbind(Corona_Totals.US.raw,Corona_Deaths.US.raw)
#TODO: custom utils- setdiff, intersect names... option to output in merging too
##------------------------------------------
# prepare raw datasets for eventual combining
##------------------------------------------
Corona_Cases.raw$City<-"NA" # US-level data has Cities
Corona_Cases.US.raw$Country_Region<-"US_state" # To differentiate from World-level stats

Corona_Cases.US.raw<-plyr::rename(Corona_Cases.US.raw,c("Province_State"="Province.State",
                                                  "Country_Region"="Country.Region",
                                                  "Long_"="Long",
                                                  "Admin2"="City"))


##------------------------------------------
## Convert to long format
##------------------------------------------
#JHU has a gross file format. It's in wide format with each column is the date in MM/DD/YY. So read this in as raw data but trasnform it to be better suited for analysis
# Furthermore, the World and US level data is formatted differently, containing different columns, etc. Recitfy this and combine the world-level stats with U.S. level stats.

Corona_Cases.long<-rbind(make_long(select(Corona_Cases.US.raw,-c(UID,iso2,iso3,code3,FIPS,Combined_Key))),
make_long(Corona_Cases.raw))


##------------------------------------------
## Fix date formatting, convert to numeric date
##------------------------------------------
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "^X",replacement = "0") # leading 0 read in as X
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "20$",replacement = "2020") # ends in .20 and not 2020
Corona_Cases.long$Date<-as.Date(Corona_Cases.long$Date,format = "%m.%d.%y")
Corona_Cases.long$Date.numeric<-as.numeric(Corona_Cases.long$Date)

kable(table(select(Corona_Cases.long,c("Country.Region","case_type"))),caption = "Number of death and total case longitudinal datapoints per geographical region")
Number of death and total case longitudinal datapoints per geographical region
death total
Afghanistan 132 132
Albania 132 132
Algeria 132 132
Andorra 132 132
Angola 132 132
Antigua and Barbuda 132 132
Argentina 132 132
Armenia 132 132
Australia 1056 1056
Austria 132 132
Azerbaijan 132 132
Bahamas 132 132
Bahrain 132 132
Bangladesh 132 132
Barbados 132 132
Belarus 132 132
Belgium 132 132
Belize 132 132
Benin 132 132
Bhutan 132 132
Bolivia 132 132
Bosnia and Herzegovina 132 132
Botswana 132 132
Brazil 132 132
Brunei 132 132
Bulgaria 132 132
Burkina Faso 132 132
Burma 132 132
Burundi 132 132
Cabo Verde 132 132
Cambodia 132 132
Cameroon 132 132
Canada 1848 1848
Central African Republic 132 132
Chad 132 132
Chile 132 132
China 4356 4356
Colombia 132 132
Comoros 132 132
Congo (Brazzaville) 132 132
Congo (Kinshasa) 132 132
Costa Rica 132 132
Cote d’Ivoire 132 132
Croatia 132 132
Cuba 132 132
Cyprus 132 132
Czechia 132 132
Denmark 396 396
Diamond Princess 132 132
Djibouti 132 132
Dominica 132 132
Dominican Republic 132 132
Ecuador 132 132
Egypt 132 132
El Salvador 132 132
Equatorial Guinea 132 132
Eritrea 132 132
Estonia 132 132
Eswatini 132 132
Ethiopia 132 132
Fiji 132 132
Finland 132 132
France 1452 1452
Gabon 132 132
Gambia 132 132
Georgia 132 132
Germany 132 132
Ghana 132 132
Greece 132 132
Grenada 132 132
Guatemala 132 132
Guinea 132 132
Guinea-Bissau 132 132
Guyana 132 132
Haiti 132 132
Holy See 132 132
Honduras 132 132
Hungary 132 132
Iceland 132 132
India 132 132
Indonesia 132 132
Iran 132 132
Iraq 132 132
Ireland 132 132
Israel 132 132
Italy 132 132
Jamaica 132 132
Japan 132 132
Jordan 132 132
Kazakhstan 132 132
Kenya 132 132
Korea, South 132 132
Kosovo 132 132
Kuwait 132 132
Kyrgyzstan 132 132
Laos 132 132
Latvia 132 132
Lebanon 132 132
Lesotho 132 132
Liberia 132 132
Libya 132 132
Liechtenstein 132 132
Lithuania 132 132
Luxembourg 132 132
Madagascar 132 132
Malawi 132 132
Malaysia 132 132
Maldives 132 132
Mali 132 132
Malta 132 132
Mauritania 132 132
Mauritius 132 132
Mexico 132 132
Moldova 132 132
Monaco 132 132
Mongolia 132 132
Montenegro 132 132
Morocco 132 132
Mozambique 132 132
MS Zaandam 132 132
Namibia 132 132
Nepal 132 132
Netherlands 660 660
New Zealand 132 132
Nicaragua 132 132
Niger 132 132
Nigeria 132 132
North Macedonia 132 132
Norway 132 132
Oman 132 132
Pakistan 132 132
Panama 132 132
Papua New Guinea 132 132
Paraguay 132 132
Peru 132 132
Philippines 132 132
Poland 132 132
Portugal 132 132
Qatar 132 132
Romania 132 132
Russia 132 132
Rwanda 132 132
Saint Kitts and Nevis 132 132
Saint Lucia 132 132
Saint Vincent and the Grenadines 132 132
San Marino 132 132
Sao Tome and Principe 132 132
Saudi Arabia 132 132
Senegal 132 132
Serbia 132 132
Seychelles 132 132
Sierra Leone 132 132
Singapore 132 132
Slovakia 132 132
Slovenia 132 132
Somalia 132 132
South Africa 132 132
South Sudan 132 132
Spain 132 132
Sri Lanka 132 132
Sudan 132 132
Suriname 132 132
Sweden 132 132
Switzerland 132 132
Syria 132 132
Taiwan* 132 132
Tajikistan 132 132
Tanzania 132 132
Thailand 132 132
Timor-Leste 132 132
Togo 132 132
Trinidad and Tobago 132 132
Tunisia 132 132
Turkey 132 132
Uganda 132 132
Ukraine 132 132
United Arab Emirates 132 132
United Kingdom 1452 1452
Uruguay 132 132
US 132 132
US_state 430452 430452
Uzbekistan 132 132
Venezuela 132 132
Vietnam 132 132
West Bank and Gaza 132 132
Western Sahara 132 132
Yemen 132 132
Zambia 132 132
Zimbabwe 132 132
# Decouple population and lat/long data, refactor to make it more tidy
metadata_columns<-c("Lat","Long","Population")
metadata<-unique(select(filter(Corona_Cases.long,case_type=="death"),c("Country.Region","Province.State","City",all_of(metadata_columns))))
Corona_Cases.long<-select(Corona_Cases.long,-all_of(metadata_columns))

# Some counties are not summarized on the country level. collapse all but US
Corona_Cases.long<-rbind.fill(ddply(filter(Corona_Cases.long,!Country.Region=="US_state"),c("case_type","Country.Region","Date","Date.numeric"),summarise,Total_confirmed_cases=sum(Total_confirmed_cases)),filter(Corona_Cases.long,Country.Region=="US_state"))

# Put total case and deaths side-by-side (wide)
Corona_Cases<-spread(Corona_Cases.long,key = case_type,value = Total_confirmed_cases)

#Compute moratlity rate
Corona_Cases$mortality_rate<-Corona_Cases$death/Corona_Cases$total

#TMP
Corona_Cases<-plyr::rename(Corona_Cases,c("total"="Total_confirmed_cases","death"="Total_confirmed_deaths"))

##------------------------------------------
## log10 transform total # cases
##------------------------------------------
Corona_Cases$Total_confirmed_cases.log<-log(Corona_Cases$Total_confirmed_cases,10)
Corona_Cases$Total_confirmed_deaths.log<-log(Corona_Cases$Total_confirmed_deaths,10)
##------------------------------------------
       
##------------------------------------------
## Compute # of days since 100th for US data
##------------------------------------------

# Find day that 100th case was found for Country/Province. NOTE: Non US countries may have weird provinces. For example, Frane is summairzed at the country level but also had 3 providences. I've only ensured the U.S. case100 works... so the case100_date for U.S. is summarized both for the entire country (regardless of state) and on a per-state level. 
# TODO: consider city-level summary as well. This data may be sparse

Corona_Cases<-merge(Corona_Cases,ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,case100_date=min(Date.numeric)))
Corona_Cases$Days_since_100<-Corona_Cases$Date.numeric-Corona_Cases$case100_date

##------------------------------------------
## Add population and lat/long data (CURRENTLY US ONLY)
##------------------------------------------

kable(filter(metadata,(is.na(Country.Region) | is.na(Population) )) %>% select(c("Country.Region","Province.State","City")) %>% unique(),caption = "Regions for which either population or Country is NA")
Regions for which either population or Country is NA
Country.Region Province.State City
# Drop missing data 
metadata<-filter(metadata,!(is.na(Country.Region) | is.na(Population) ))
# Convert remaining pop to numeric
metadata$Population<-as.numeric(metadata$Population)
## Warning: NAs introduced by coercion
# Add metadata to cases
Corona_Cases<-merge(Corona_Cases,metadata,all.x = T)

##------------------------------------------
## Compute total and death cases relative to population 
##------------------------------------------

Corona_Cases$Total_confirmed_cases.per100<-100*Corona_Cases$Total_confirmed_cases/Corona_Cases$Population
Corona_Cases$Total_confirmed_deaths.per100<-100*Corona_Cases$Total_confirmed_deaths/Corona_Cases$Population


##------------------------------------------
## Filter df for US state-wide stats
##------------------------------------------

Corona_Cases.US_state<-filter(Corona_Cases,Country.Region=="US_state" & Total_confirmed_cases>0 )
Corona_Cases.US_state[!is.na(Corona_Cases.US_state$Province.State) & Corona_Cases.US_state$Province.State=="Pennsylvania" & Corona_Cases.US_state$City=="Delaware","City"]<-"Delaware_PA"

kable(table(select(Corona_Cases.US_state,c("Province.State"))),caption = "Number of longitudinal datapoints (total/death) per state")
Number of longitudinal datapoints (total/death) per state
Var1 Freq
Alabama 4669
Alaska 887
Arizona 1166
Arkansas 4905
California 4446
Colorado 4247
Connecticut 691
Delaware 287
Diamond Princess 77
District of Columbia 78
Florida 4974
Georgia 11137
Grand Princess 78
Guam 78
Hawaii 402
Idaho 2234
Illinois 6424
Indiana 6469
Iowa 5982
Kansas 4974
Kentucky 7186
Louisiana 4670
Maine 1186
Maryland 1817
Massachusetts 1169
Michigan 5662
Minnesota 5358
Mississippi 5791
Missouri 6451
Montana 2000
Nebraska 3756
Nevada 891
New Hampshire 818
New Jersey 1751
New Mexico 1972
New York 4439
North Carolina 6836
North Dakota 2344
Northern Mariana Islands 63
Ohio 6093
Oklahoma 4638
Oregon 2345
Pennsylvania 4802
Puerto Rico 78
Rhode Island 460
South Carolina 3375
South Dakota 2971
Tennessee 6592
Texas 13929
Utah 1071
Vermont 1095
Virgin Islands 78
Virginia 8632
Washington 3034
West Virginia 3172
Wisconsin 4634
Wyoming 1435
Corona_Cases.US_state<-merge(Corona_Cases.US_state,ddply(filter(Corona_Cases.US_state,Total_confirmed_cases>100),c("Province.State"),summarise,case100_date_state=min(Date.numeric)))
Corona_Cases.US_state$Days_since_100_state<-Corona_Cases.US_state$Date.numeric-Corona_Cases.US_state$case100_date_state

ANALYSIS

Q1: What is the trend in cases, mortality across geopgraphical regions?

Plot # of cases vs time
* For each geographical set:
* comparative longitudinal case trend (absolute & log scale)
* comparative longitudinal mortality trend
* death vs total correlation

question dataset x y color facet pch dimentions
comparative_longitudinal_case_trend long time log_cases geography none (case type?) case_type [15, 50, 4] geography x (2 scale?) case type
comparative longitudinal case trend long time cases geography case_type ? [15, 50, 4] geography x (2+ scale) case type
comparative longitudinal mortality trend wide time mortality rate geography none none [15, 50, 4] geography
death vs total correlation wide cases deaths geography none none [15, 50, 4] geography
# total cases vs time
# death cases vs time
# mortality rate vs time
# death vs mortality


  # death vs mortality
  # total & death case vs time (same plot)

#<question> <x> <y> <colored> <facet> <dataset>
## trend in case/deaths over time, comapred across regions <time> <log cases> <geography*> <none> <.wide>
## trend in case/deaths over time, comapred across regions <time> <cases> <geography*> <case_type> <.long>
## trend in mortality rate over time, comapred across regions <time> <mortality rate> <geography*> <none>
## how are death/mortality related/correlated? <time> <log cases> <geography*> <none>
## how are death and case load correlated? <cases> <deaths>

# lm for each?? - > apply lm from each region starting from 100th case. m, b associated with each.
    # input: geographical regsion, logcase vs day (100th case)
    # output: m, b for each geographical region ID



#total/death on same plot-  diffeer by 2 logs, so when plotting log, use pch. when plotting absolute, need to use free scales
#when plotting death and case on same, melt. 

#CoronaCases - > filter sets (3)
  #world - choose countries with sufficent data

N<-ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,n=length(Country.Region))
ggplot(filter(N,n<100),aes(x=n))+
  geom_histogram()+
  default_theme+
  ggtitle("Distribution of number of days with at least 100 confirmed cases for each region")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

kable(arrange(N,-n),caption="Sorted number of days with at least 100 confirmed cases")
Sorted number of days with at least 100 confirmed cases
Country.Region n
US_state 44910
China 132
Diamond Princess 113
Korea, South 103
Japan 102
Italy 100
Iran 97
Singapore 94
France 93
Germany 93
Spain 92
US 91
Switzerland 89
United Kingdom 89
Belgium 88
Netherlands 88
Norway 88
Sweden 88
Austria 86
Malaysia 85
Australia 84
Bahrain 84
Denmark 84
Canada 83
Qatar 83
Iceland 82
Brazil 81
Czechia 81
Finland 81
Greece 81
Iraq 81
Israel 81
Portugal 81
Slovenia 81
Egypt 80
Estonia 80
India 80
Ireland 80
Kuwait 80
Philippines 80
Poland 80
Romania 80
Saudi Arabia 80
Indonesia 79
Lebanon 79
San Marino 79
Thailand 79
Chile 78
Pakistan 78
Luxembourg 77
Peru 77
Russia 77
Ecuador 76
Mexico 76
Slovakia 76
South Africa 76
United Arab Emirates 76
Armenia 75
Colombia 75
Croatia 75
Panama 75
Serbia 75
Taiwan* 75
Turkey 75
Argentina 74
Bulgaria 74
Latvia 74
Uruguay 74
Algeria 73
Costa Rica 73
Dominican Republic 73
Hungary 73
Andorra 72
Bosnia and Herzegovina 72
Jordan 72
Lithuania 72
Morocco 72
New Zealand 72
North Macedonia 72
Vietnam 72
Albania 71
Cyprus 71
Malta 71
Moldova 71
Brunei 70
Burkina Faso 70
Sri Lanka 70
Tunisia 70
Ukraine 69
Azerbaijan 68
Ghana 68
Kazakhstan 68
Oman 68
Senegal 68
Venezuela 68
Afghanistan 67
Cote d’Ivoire 67
Cuba 66
Mauritius 66
Uzbekistan 66
Cambodia 65
Cameroon 65
Honduras 65
Nigeria 65
West Bank and Gaza 65
Belarus 64
Georgia 64
Bolivia 63
Kosovo 63
Kyrgyzstan 63
Montenegro 63
Congo (Kinshasa) 62
Kenya 61
Niger 60
Guinea 59
Rwanda 59
Trinidad and Tobago 59
Paraguay 58
Bangladesh 57
Djibouti 55
El Salvador 54
Guatemala 53
Madagascar 52
Mali 51
Congo (Brazzaville) 48
Jamaica 48
Gabon 46
Somalia 46
Tanzania 46
Ethiopia 45
Burma 44
Sudan 43
Liberia 42
Maldives 40
Equatorial Guinea 39
Cabo Verde 37
Sierra Leone 35
Guinea-Bissau 34
Togo 34
Zambia 33
Eswatini 32
Chad 31
Tajikistan 30
Haiti 28
Sao Tome and Principe 28
Benin 26
Nepal 26
Uganda 26
Central African Republic 25
South Sudan 25
Guyana 23
Mozambique 22
Yemen 18
Mongolia 17
Mauritania 14
Nicaragua 14
Malawi 8
Syria 8
Zimbabwe 6
Bahamas 5
Libya 5
Comoros 3
# Pick top 15 countries with data
max_colors<-12
# find way to fix this- China has diff provences. Plot doesnt look right...
sufficient_data<-arrange(filter(N,!Country.Region %in% c("US_state", "Diamond Princess")),-n)[1:max_colors,]
kable(sufficient_data,caption = paste0("Top ",max_colors," countries with sufficient data"))
Top 12 countries with sufficient data
Country.Region n
China 132
Korea, South 103
Japan 102
Italy 100
Iran 97
Singapore 94
France 93
Germany 93
Spain 92
US 91
Switzerland 89
United Kingdom 89
Corona_Cases.world<-filter(Corona_Cases,Country.Region %in% c(sufficient_data$Country.Region))


  #us 
  #    - by state
Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
# summarize 
#!City %in% c("Unassigned") 
  #    - specific cities
#mortality_rate!=Inf & mortality_rate<=1
head(Corona_Cases)
##   Country.Region Province.State City       Date Date.numeric
## 1    Afghanistan           <NA> <NA> 2020-04-04        18356
## 2    Afghanistan           <NA> <NA> 2020-02-05        18297
## 3    Afghanistan           <NA> <NA> 2020-05-08        18390
## 4    Afghanistan           <NA> <NA> 2020-05-10        18392
## 5    Afghanistan           <NA> <NA> 2020-01-22        18283
## 6    Afghanistan           <NA> <NA> 2020-04-02        18354
##   Total_confirmed_deaths Total_confirmed_cases mortality_rate
## 1                      7                   299     0.02341137
## 2                      0                     0            NaN
## 3                    109                  3778     0.02885124
## 4                    120                  4402     0.02726034
## 5                      0                     0            NaN
## 6                      6                   273     0.02197802
##   Total_confirmed_cases.log Total_confirmed_deaths.log case100_date
## 1                  2.475671                  0.8450980        18348
## 2                      -Inf                       -Inf        18348
## 3                  3.577262                  2.0374265        18348
## 4                  3.643650                  2.0791812        18348
## 5                      -Inf                       -Inf        18348
## 6                  2.436163                  0.7781513        18348
##   Days_since_100 Lat Long Population Total_confirmed_cases.per100
## 1              8  NA   NA         NA                           NA
## 2            -51  NA   NA         NA                           NA
## 3             42  NA   NA         NA                           NA
## 4             44  NA   NA         NA                           NA
## 5            -65  NA   NA         NA                           NA
## 6              6  NA   NA         NA                           NA
##   Total_confirmed_deaths.per100
## 1                            NA
## 2                            NA
## 3                            NA
## 4                            NA
## 5                            NA
## 6                            NA
Corona_Cases[!is.na(Corona_Cases$Province.State) & Corona_Cases$Province.State=="Pennsylvania" & Corona_Cases$City=="Delaware","City"]<-"Delaware_PA"


Corona_Cases.UScity<-filter(Corona_Cases,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") & City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA"))


measure_vars_long<-c("Total_confirmed_cases.log","Total_confirmed_cases","Total_confirmed_deaths","Total_confirmed_deaths.log")
melt_arg_list<-list(variable.name = "case_type",value.name = "cases",measure.vars = c("Total_confirmed_cases","Total_confirmed_deaths"))
melt_arg_list$data=NULL


melt_arg_list$data=select(Corona_Cases.world,-ends_with(match = "log"))
Corona_Cases.world.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.UScity,-ends_with(match = "log"))
Corona_Cases.UScity.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.US_state,-ends_with(match = "log"))
Corona_Cases.US_state.long<-do.call(melt,melt_arg_list)

Corona_Cases.world.long$cases.log<-log(Corona_Cases.world.long$cases,10)
Corona_Cases.US_state.long$cases.log<-log(Corona_Cases.US_state.long$cases,10)
Corona_Cases.UScity.long$cases.log<-log(Corona_Cases.UScity.long$cases,10)


# what is the current death and total case load for US? For world? For states?
#-absolute
#-log

# what is mortality rate (US, world)
#-absolute

#how is death and case correlated? (US, world)
#-absolute
#Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
#Corona_Cases.US.case100<-filter(Corona_Cases.US, Days_since_100>=0)
# linear model parameters
#(model_fit<-lm(formula = Total_confirmed_cases.log~Days_since_100,data= Corona_Cases.US.case100 ))

#(slope<-model_fit$coefficients[2])
#(intercept<-model_fit$coefficients[1])

# Correlation coefficient
#cor(x = Corona_Cases.US.case100$Days_since_100,y = Corona_Cases.US.case100$Total_confirmed_cases.log)

##------------------------------------------
## Plot World Data
##------------------------------------------
# Timestamp for world
timestamp_plot.world<-paste("Most recent date for which data available:",max(Corona_Cases.world$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")


# Base template for plots
baseplot.world<-ggplot(data=NULL,aes(x=Days_since_100,col=Country.Region))+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))


##/////////////////////////
### Plot Longitudinal cases

(Corona_Cases.world.long.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world)
    )

(Corona_Cases.world.loglong.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases.log))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases.log))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world))

##/////////////////////////
### Plot Longitudinal mortality rate

(Corona_Cases.world.mortality.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world,aes(y=mortality_rate))+
    geom_line(data=Corona_Cases.world,aes(y=mortality_rate))+
    ylim(c(0,0.3))+
    ggtitle(timestamp_plot.world))
## Warning: Removed 100 rows containing missing values (geom_point).
## Warning: Removed 100 row(s) containing missing values (geom_path).

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.world.casecor.plot<-ggplot(Corona_Cases.world,aes(x=Total_confirmed_cases,y=Total_confirmed_deaths,col=Country.Region))+
  geom_point()+
  geom_line()+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
    ggtitle(timestamp_plot.world))

### Write polots

write_plot(Corona_Cases.world.long.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.long.plot.png"
write_plot(Corona_Cases.world.loglong.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.loglong.plot.png"
write_plot(Corona_Cases.world.mortality.plot,wd = results_dir)
## Warning: Removed 100 rows containing missing values (geom_point).

## Warning: Removed 100 row(s) containing missing values (geom_path).
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.mortality.plot.png"
write_plot(Corona_Cases.world.casecor.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.casecor.plot.png"
##------------------------------------------
## Plot US State Data
##-----------------------------------------

baseplot.US<-ggplot(data=NULL,aes(x=Days_since_100_state,col=case_type))+
  default_theme+
  facet_wrap(~Province.State)+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))

Corona_Cases.US_state.long.plot<-baseplot.US+geom_point(data=Corona_Cases.US_state.long,aes(y=cases.log))
##------------------------------------------
## Plot US City Data
##-----------------------------------------

Corona_Cases.US.plotdata<-filter(Corona_Cases.US_state,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") &
                                   City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA") &
                                   Total_confirmed_cases>0) 
timestamp_plot<-paste("Most recent date for which data available:",max(Corona_Cases.US.plotdata$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")

city_colors<-c("Bucks"='#beaed4',"Baltimore City"='#386cb0', "New York"='#7fc97f',"Burlington"='#fdc086',"Cape May"="#e78ac3","Delaware_PA"="#b15928")

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.city.loglong.plot<-ggplot(melt(Corona_Cases.US.plotdata,measure.vars = c("Total_confirmed_cases.log","Total_confirmed_deaths.log"),variable.name = "case_type",value.name = "cases"),aes(x=Date,y=cases,col=City,pch=case_type))+
  geom_point(size=4)+
    geom_line()+
  default_theme+
  #facet_wrap(~case_type)+
    ggtitle(paste("Log10 total and death cases over time,",timestamp_plot))+
theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
    scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.long.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State,scales = "free_y")+
  ggtitle(paste("MD, PA, NJ total cases over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))
+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.mortality.plot<-ggplot(Corona_Cases.US.plotdata,aes(x=Date,y=mortality_rate,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Mortality rate (deaths/total) over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.casecor.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(y=Total_confirmed_deaths,x=Total_confirmed_cases,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Correlation of death vs total cases,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
  scale_color_manual(values = city_colors))

(Corona_Cases.city.long.normalized.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases.per100,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State)+
  ggtitle(paste("MD, PA, NJ total cases over time per 100 people,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)  +
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

write_plot(Corona_Cases.city.long.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.plot.png"
write_plot(Corona_Cases.city.loglong.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.loglong.plot.png"
write_plot(Corona_Cases.city.mortality.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.mortality.plot.png"
write_plot(Corona_Cases.city.casecor.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.casecor.plot.png"
write_plot(Corona_Cases.city.long.normalized.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.normalized.plot.png"

Q1b what is the model

Fit the cases to a linear model 1. Find time at which the case vs date becomes linear in each plot
2. Fit linear model for each city

# What is the predict # of cases for the next few days?
# How is the model performing historically?

Corona_Cases.US_state.summary<-ddply(Corona_Cases.US_state,
                                     c("Province.State","Date"),
                                     summarise,
                                     Total_confirmed_cases_perstate=sum(Total_confirmed_cases)) %>% 
    filter(Total_confirmed_cases_perstate>100)

# Compute the states with the most cases (for coloring and for linear model)
top_states_totals<-head(ddply(Corona_Cases.US_state.summary,c("Province.State"),summarise, Total_confirmed_cases_perstate.max=max(Total_confirmed_cases_perstate)) %>% arrange(-Total_confirmed_cases_perstate.max),n=max_colors)

kable(top_states_totals,caption = "Top 12 States, total count ")
top_states<-top_states_totals$Province.State

# Manually fix states so that Maryland is switched out for New York
top_states_modified<-c(top_states[top_states !="New York"],"Maryland")

# Plot with all states:
(Corona_Cases.US_state.summary.plot<-ggplot(Corona_Cases.US_state.summary,aes(x=Date,y=Total_confirmed_cases_perstate))+
  geom_point()+
  geom_point(data=filter(Corona_Cases.US_state.summary,Province.State %in% top_states),aes(col=Province.State))+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  default_theme+
  theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
  ggtitle("Total confirmed cases per state, top 12 colored")+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Fit linear model to time vs total cases
##-----------------------------------------

# First, find the date at which each state's cases vs time becomes lienar (2nd derivative is about 0)
li<-ddply(Corona_Cases.US_state.summary,c("Province.State"),find_linear_index)

# Compute linear model for each state starting at the point at which data becomes linear
for(i in 1:nrow(li)){
  Province.State.i<-li[i,"Province.State"]
  date.i<-li[i,"V1"]
  data.i<-filter(Corona_Cases.US_state.summary,Province.State==Province.State.i & as.numeric(Date) >= date.i)
  model_results<-lm(data.i,formula = Total_confirmed_cases_perstate~Date)
  slope<-model_results$coefficients[2]
  intercept<-model_results$coefficients[1]
  li[li$Province.State==Province.State.i,"m"]<-slope
  li[li$Province.State==Province.State.i,"b"]<-intercept
  }

# Compute top state case load with fitted model

(Corona_Cases.US_state.lm.plot<-ggplot(filter(Corona_Cases.US_state.summary,Province.State %in% top_states_modified ))+
    geom_abline(data=filter(li,Province.State %in% top_states_modified),
                aes(slope = m,intercept = b,col=Province.State),lty=2)+
    geom_point(aes(x=Date,y=Total_confirmed_cases_perstate,col=Province.State))+
    scale_color_brewer(type = "qualitative",palette = "Paired")+
    default_theme+
    theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
    ggtitle("Total confirmed cases per state, top 12 colored")+
    scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Predict the number of total cases over the next week
##-----------------------------------------

predicted_days<-c(0,1,2,3,7)+as.numeric(as.Date("2020-04-20"))

predicted_days_df<-data.frame(matrix(ncol=3))
names(predicted_days_df)<-c("Province.State","days","Total_confirmed_cases_perstate")

# USe model parameters to estiamte case loads
for(state.i in top_states_modified){
  predicted_days_df<-rbind(predicted_days_df,
                           data.frame(Province.State=state.i,
                                      prediction_model(m = li[li$Province.State==state.i,"m"],
                                                       b =li[li$Province.State==state.i,"b"] ,
                                                       days =predicted_days )))
  }

predicted_days_df$Date<-as.Date(predicted_days_df$days,origin="1970-01-01")

kable(predicted_days_df,caption = "Predicted total cases over the next week for selected states")

##------------------------------------------
## Write plots
##-----------------------------------------

write_plot(Corona_Cases.US_state.summary.plot,wd = results_dir)
write_plot(Corona_Cases.US_state.lm.plot,wd = results_dir)

##------------------------------------------
## Write tables
##-----------------------------------------

write.csv(predicted_days_df,file = paste0(results_dir,"predicted_total_cases_days.csv"),quote = F,row.names = F)

Q2: What is the predicted number of cases?

What is the prediction of COVID-19 based on model thus far? Additional questions:

WHy did it take to day 40 to start a log linear trend? How long will it be till x number of cases? When will the plateu happen? Are any effects noticed with social distancing? Delays

##------------------------------------------
## Prediction and Prediction Accuracy
##------------------------------------------


today_num<-max(Corona_Cases.US$Days_since_100)
predicted_days<-today_num+c(1,2,3,7)

#mods = dlply(mydf, .(x3), lm, formula = y ~ x1 + x2)
#today:
Corona_Cases.US[Corona_Cases.US$Days_since_100==(today_num-1),]
Corona_Cases.US[Corona_Cases.US$Days_since_100==today_num,]
Corona_Cases.US$type<-"Historical"


#prediction_values<-prediction_model(m=slope,b=intercept,days = predicted_days)$Total_confirmed_cases

histoical_model<-data.frame(date=today_num,m=slope,b=intercept)
tmp<-data.frame(state=rep(c("A","B"),each=3),x=c(1,2,3,4,5,6))
tmp$y<-c(tmp[1:3,"x"]+5,tmp[4:6,"x"]*5+1)
ddply(tmp,c("state"))
lm(data =tmp,formula = y~x )

train_lm<-function(input_data,subset_coulmn,formula_input){
case_models <- dlply(input_data, subset_coulmn, lm, formula = formula_input)
case_models.parameters <- ldply(case_models, coef)
case_models.parameters<-rename(case_models.parameters,c("b"="(Intercept)","m"=subset_coulmn))
return(case_models.parameters)
}

train_lm(tmp,"state")

 dlply(input_data, subset_coulmn, lm,m=)
 
# model for previous y days
#historical_model_predictions<-data.frame(day_x=NULL,Days_since_100=NULL,Total_confirmed_cases=NULL,Total_confirmed_cases.log=NULL)
# for(i in c(1,2,3,4,5,6,7,8,9,10)){
#   #i<-1
# day_x<-today_num-i # 1, 2, 3, 4
# day_x_nextweek<-day_x+c(1,2,3)
# model_fit_x<-lm(data = filter(Corona_Cases.US.case100,Days_since_100 < day_x),formula = Total_confirmed_cases.log~Days_since_100)
# prediction_day_x_nextweek<-prediction_model(m = model_fit_x$coefficients[2],b = model_fit_x$coefficients[1],days = day_x_nextweek)
# prediction_day_x_nextweek$type<-"Predicted"
# acutal_day_x_nextweek<-filter(Corona_Cases.US,Days_since_100 %in% day_x_nextweek) %>% select(c(Days_since_100,Total_confirmed_cases,Total_confirmed_cases.log))
# acutal_day_x_nextweek$type<-"Historical"
# historical_model_predictions.i<-data.frame(day_x=day_x,rbind(acutal_day_x_nextweek,prediction_day_x_nextweek))
# historical_model_predictions<-rbind(historical_model_predictions.i,historical_model_predictions)
# }

#historical_model_predictions.withHx<-rbind.fill(historical_model_predictions,data.frame(Corona_Cases.US,type="Historical"))
#historical_model_predictions.withHx$Total_confirmed_cases.log2<-log(historical_model_predictions.withHx$Total_confirmed_cases,2)

(historical_model_predictions.plot<-ggplot(historical_model_predictions.withHx,aes(x=Days_since_100,y=Total_confirmed_cases.log,col=type))+
    geom_point(size=3)+
    default_theme+
    theme(legend.position = "bottom")+ 
      #geom_abline(slope = slope,intercept =intercept,lty=2)+
    #facet_wrap(~case_type,ncol=1)+
    scale_color_manual(values = c("Historical"="#377eb8","Predicted"="#e41a1c")))
write_plot(historical_model_predictions.plot,wd=results_dir)

Q3: What is the effect on social distancing, descreased mobility on case load?

Load data from Google which compoutes % change in user mobility relative to baseline for * Recreation
* Workplace
* Residence
* Park
* Grocery

Data from https://www.google.com/covid19/mobility/

# See pre-processing section for script on gathering mobility data

# UNDER DEVELOPMENT

mobility<-read.csv("/Users/stevensmith/Projects/MIT_COVID19/mobility.csv",header = T,stringsAsFactors = F)
#mobility$Retail_Recreation<-as.numeric(sub(mobility$Retail_Recreation,pattern = "%",replacement = ""))
#mobility$Workplace<-as.numeric(sub(mobility$Workplace,pattern = "%",replacement = ""))
#mobility$Residential<-as.numeric(sub(mobility$Residential,pattern = "%",replacement = ""))

##------------------------------------------
## Show relationship between mobility and caseload
##------------------------------------------
mobility$County<-gsub(mobility$County,pattern = " County",replacement = "")
Corona_Cases.US_state.mobility<-merge(Corona_Cases.US_state,plyr::rename(mobility,c("State"="Province.State","County"="City")))

#Corona_Cases.US_state.tmp<-merge(metadata,Corona_Cases.US_state.tmp)
# Needs to happen upsteam, see todos
#Corona_Cases.US_state.tmp$Total_confirmed_cases.perperson<-Corona_Cases.US_state.tmp$Total_confirmed_cases/as.numeric(Corona_Cases.US_state.tmp$Population)
mobility_measures<-c("Retail_Recreation","Grocery_Pharmacy","Parks","Transit","Workplace","Residential")

plot_data<-filter(Corona_Cases.US_state.mobility, Date.numeric==max(Corona_Cases.US_state$Date.numeric) ) %>% melt(measure.vars=mobility_measures) 
plot_data$value<-as.numeric(gsub(plot_data$value,pattern = "%",replacement = ""))
plot_data<-filter(plot_data,!is.na(value))

(mobility.plot<-ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_grid(Province.State~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases per 100 people(Today)"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

(mobility.global.plot<-ggplot(plot_data,aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_wrap(~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases (Today) per 100 people"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

plot_data.permobility_summary<-ddply(plot_data,c("Province.State","variable"),summarise,cor=cor(y =Total_confirmed_cases.per100,x=value),median_change=median(x=value)) %>% arrange(-abs(cor))

kable(plot_data.permobility_summary,caption = "Ranked per-state mobility correlation with total confirmed cases")
Ranked per-state mobility correlation with total confirmed cases
Province.State variable cor median_change
Alaska Transit -1.0000000 -63.0
Delaware Retail_Recreation 1.0000000 -39.5
Delaware Grocery_Pharmacy 1.0000000 -17.5
Delaware Parks -1.0000000 20.5
Delaware Transit 1.0000000 -37.0
Delaware Workplace 1.0000000 -37.0
Delaware Residential -1.0000000 14.0
Hawaii Retail_Recreation 0.9970242 -56.0
Hawaii Grocery_Pharmacy 0.9784683 -34.0
New Hampshire Parks 0.9573557 -20.0
Maine Transit -0.9121316 -50.0
Connecticut Grocery_Pharmacy -0.8928633 -6.0
Utah Residential -0.8866356 12.0
Vermont Parks 0.8568199 -35.5
South Dakota Parks 0.8501224 -26.0
Alaska Residential 0.8223051 13.0
Utah Transit -0.8023751 -18.0
Wyoming Parks -0.7897688 -4.0
Alaska Grocery_Pharmacy -0.7778882 -7.0
Hawaii Residential -0.7604103 19.0
Massachusetts Workplace -0.7576328 -39.0
Rhode Island Workplace -0.7539745 -39.5
Connecticut Transit -0.7524861 -50.0
Alaska Workplace -0.7210146 -33.0
Wyoming Transit -0.7137323 -17.0
Hawaii Parks 0.7097505 -72.0
Arizona Grocery_Pharmacy -0.6813940 -15.0
Utah Parks -0.6621592 17.0
Vermont Grocery_Pharmacy -0.6501176 -25.0
Hawaii Transit 0.6494429 -89.0
New York Workplace -0.6489119 -34.5
Maine Workplace -0.6422595 -30.0
Rhode Island Retail_Recreation -0.6296900 -45.0
Utah Workplace -0.6206909 -37.0
Rhode Island Residential -0.6121146 18.5
New Jersey Parks -0.6012704 -6.0
New York Retail_Recreation -0.5875938 -46.0
Nebraska Workplace 0.5844619 -32.0
New Jersey Workplace -0.5712162 -44.0
North Dakota Parks 0.5601724 -34.0
Arizona Retail_Recreation -0.5388995 -42.5
New York Parks 0.5280074 20.0
Connecticut Residential 0.5213770 14.0
Maine Parks 0.5109222 -31.0
Connecticut Retail_Recreation -0.5103996 -45.0
Massachusetts Retail_Recreation -0.5073780 -44.0
Hawaii Workplace 0.5059401 -46.0
North Dakota Retail_Recreation -0.5042321 -42.0
Montana Parks -0.4833986 -58.0
New Jersey Grocery_Pharmacy -0.4811869 2.5
Connecticut Workplace -0.4809914 -39.0
Arizona Residential 0.4803346 13.0
Nebraska Residential -0.4779114 14.0
New Mexico Grocery_Pharmacy -0.4772195 -11.0
Kansas Parks 0.4753706 72.0
New Jersey Retail_Recreation -0.4714788 -62.5
Montana Residential 0.4692623 14.0
Rhode Island Parks 0.4677024 52.0
Iowa Parks -0.4652742 28.5
West Virginia Parks 0.4624648 -33.0
Wyoming Workplace -0.4500780 -31.0
New Mexico Residential 0.4481430 13.5
Nevada Transit -0.4448766 -20.0
New Mexico Parks 0.4408143 -31.5
Illinois Transit -0.4387005 -31.0
Pennsylvania Workplace -0.4310680 -36.0
Vermont Residential 0.4269771 11.5
South Carolina Workplace 0.4258953 -30.0
New Jersey Transit -0.4190853 -50.5
Maryland Workplace -0.4162764 -35.0
Massachusetts Grocery_Pharmacy -0.4145629 -7.0
Kentucky Parks -0.4087646 28.5
Arizona Transit 0.4065820 -38.0
Alabama Grocery_Pharmacy -0.4054484 -2.0
New Hampshire Residential -0.4052968 14.0
Idaho Workplace -0.3886576 -29.0
California Transit -0.3877004 -42.0
Idaho Grocery_Pharmacy -0.3847052 -5.5
Maryland Grocery_Pharmacy -0.3843552 -10.0
Pennsylvania Retail_Recreation -0.3772148 -45.0
Wisconsin Transit -0.3768401 -23.5
Alabama Workplace -0.3759370 -29.0
New York Transit -0.3719292 -48.0
Idaho Residential -0.3700789 11.0
Idaho Transit -0.3647859 -30.0
California Residential 0.3595870 14.0
Michigan Parks 0.3524173 30.0
New Mexico Retail_Recreation -0.3503954 -42.5
Wyoming Grocery_Pharmacy -0.3388058 -10.0
Nebraska Grocery_Pharmacy 0.3387946 -0.5
North Carolina Grocery_Pharmacy 0.3261778 0.0
Alaska Retail_Recreation 0.3229545 -39.0
Maine Retail_Recreation -0.3167500 -42.0
California Parks -0.3107453 -38.5
Minnesota Transit -0.3069517 -28.5
Virginia Transit -0.3035804 -33.0
North Dakota Workplace 0.3035238 -40.0
Vermont Retail_Recreation 0.3025173 -57.0
Florida Residential 0.3004484 14.0
Montana Workplace -0.2986192 -40.5
Maryland Retail_Recreation -0.2981025 -39.0
Alabama Transit -0.2977630 -36.5
Pennsylvania Parks 0.2914738 12.0
Mississippi Residential 0.2885872 13.0
North Carolina Workplace 0.2875498 -31.0
Arkansas Retail_Recreation -0.2849503 -30.0
Colorado Residential 0.2838787 14.0
Illinois Workplace -0.2820911 -30.5
Nevada Residential 0.2782004 17.0
Texas Residential -0.2722386 15.0
Rhode Island Transit -0.2692428 -56.0
Idaho Retail_Recreation -0.2690135 -39.5
Oregon Grocery_Pharmacy 0.2666934 -7.0
Maryland Residential 0.2595653 15.0
Vermont Workplace -0.2581423 -43.0
Georgia Grocery_Pharmacy -0.2570034 -10.0
Rhode Island Grocery_Pharmacy 0.2569036 -7.5
Texas Workplace 0.2562148 -32.0
Kansas Workplace 0.2529042 -32.5
Tennessee Workplace -0.2506369 -31.0
Illinois Parks 0.2498608 26.5
West Virginia Grocery_Pharmacy -0.2486295 -6.0
Missouri Residential -0.2482962 13.0
Arkansas Residential 0.2436619 12.0
Nevada Retail_Recreation -0.2407879 -43.0
Pennsylvania Grocery_Pharmacy -0.2385445 -6.0
Florida Parks -0.2369368 -43.0
Tennessee Residential 0.2354791 11.5
Utah Retail_Recreation -0.2313767 -40.0
New York Grocery_Pharmacy -0.2312840 8.0
Wisconsin Parks 0.2307834 51.5
South Carolina Parks -0.2307128 -23.0
Michigan Workplace -0.2126280 -40.0
North Carolina Transit 0.2110670 -32.0
Georgia Workplace -0.2096819 -33.5
Illinois Residential 0.2051349 14.0
Montana Transit 0.2050005 -41.0
Texas Parks 0.2021709 -42.0
West Virginia Workplace 0.2001197 -33.0
Kansas Grocery_Pharmacy -0.1995694 -14.0
Mississippi Grocery_Pharmacy -0.1985146 -8.0
North Carolina Residential 0.1968322 13.0
Washington Workplace -0.1929957 -38.0
Georgia Retail_Recreation -0.1896178 -41.0
Colorado Parks -0.1883144 2.0
California Grocery_Pharmacy -0.1859970 -11.5
New Jersey Residential 0.1856709 18.0
North Dakota Grocery_Pharmacy -0.1831079 -8.0
California Retail_Recreation -0.1811824 -44.0
Virginia Residential 0.1787704 14.0
Minnesota Parks 0.1772339 -9.0
Iowa Transit 0.1743987 -24.0
Connecticut Parks 0.1724763 43.0
Wisconsin Residential -0.1702238 14.0
Missouri Workplace 0.1686437 -28.0
Florida Retail_Recreation 0.1671878 -43.0
Washington Grocery_Pharmacy 0.1659547 -7.0
New Mexico Transit 0.1653675 -38.5
New Hampshire Retail_Recreation -0.1628075 -41.0
Massachusetts Parks 0.1604369 39.0
Ohio Transit 0.1595887 -28.0
Pennsylvania Transit -0.1588782 -42.0
Kentucky Grocery_Pharmacy 0.1572571 4.0
South Dakota Transit -0.1530477 -40.0
Georgia Residential -0.1520126 13.0
Indiana Retail_Recreation 0.1518894 -38.0
Virginia Grocery_Pharmacy -0.1518601 -8.0
Indiana Residential 0.1505329 12.0
California Workplace -0.1456911 -36.0
Oregon Transit 0.1442180 -27.5
Mississippi Retail_Recreation -0.1413135 -40.0
Alabama Parks 0.1406385 -1.0
Wisconsin Workplace -0.1382043 -31.0
Washington Residential 0.1349448 13.0
Kentucky Transit -0.1346924 -31.0
South Dakota Workplace 0.1343200 -35.0
North Dakota Transit 0.1339252 -48.0
Virginia Parks 0.1326653 6.0
Oregon Retail_Recreation 0.1323390 -40.5
Oklahoma Residential 0.1301057 15.0
South Dakota Residential 0.1299771 15.0
Alabama Retail_Recreation 0.1262967 -39.0
Nebraska Transit -0.1249976 -9.0
Michigan Retail_Recreation -0.1226299 -53.0
Mississippi Transit -0.1225904 -38.5
Minnesota Retail_Recreation 0.1177209 -40.0
Massachusetts Transit -0.1174559 -45.0
New Hampshire Grocery_Pharmacy -0.1171144 -6.0
Texas Grocery_Pharmacy 0.1144016 -14.0
Ohio Parks -0.1090305 67.5
Wyoming Residential 0.1069266 12.5
Texas Transit 0.1062174 -41.0
Nebraska Retail_Recreation 0.1055632 -36.0
Kansas Transit -0.1049199 -26.5
Massachusetts Residential 0.1044116 15.0
Wisconsin Grocery_Pharmacy 0.1043014 -1.0
Idaho Parks 0.1039773 -22.0
Arkansas Workplace -0.1036591 -26.0
Maryland Transit -0.1018871 -39.0
Indiana Parks -0.1002475 29.0
Ohio Residential 0.0997056 14.0
Minnesota Workplace -0.0988423 -33.0
Washington Transit -0.0969411 -33.5
Oregon Workplace -0.0967532 -31.0
Nevada Parks 0.0966512 -12.5
Oklahoma Grocery_Pharmacy -0.0958059 -0.5
South Dakota Retail_Recreation -0.0957820 -38.5
North Carolina Parks -0.0954447 7.0
Oklahoma Parks -0.0948622 -18.5
South Carolina Residential -0.0943120 12.0
Oregon Residential 0.0938647 10.5
New York Residential 0.0932214 17.5
Florida Workplace -0.0915492 -33.0
Wyoming Retail_Recreation -0.0915302 -39.0
Arkansas Parks 0.0900976 -12.0
Arkansas Transit 0.0898321 -27.0
Maine Residential -0.0885378 11.0
Missouri Transit -0.0868292 -24.5
North Dakota Residential -0.0865029 17.0
Indiana Workplace 0.0862214 -34.0
Arizona Parks -0.0860539 -44.5
Georgia Parks 0.0858745 -6.0
Kentucky Residential 0.0858743 12.0
Pennsylvania Residential 0.0856287 15.0
Arizona Workplace -0.0840614 -35.0
Virginia Workplace -0.0832561 -31.5
Oregon Parks 0.0812419 16.5
Kentucky Retail_Recreation 0.0805094 -29.0
Tennessee Parks -0.0789573 10.5
Michigan Residential 0.0786025 15.0
Virginia Retail_Recreation -0.0780895 -35.0
North Carolina Retail_Recreation 0.0768794 -34.0
Ohio Grocery_Pharmacy 0.0767693 0.0
West Virginia Residential -0.0757364 11.0
Indiana Grocery_Pharmacy -0.0713127 -5.5
Colorado Transit 0.0707340 -36.0
Michigan Grocery_Pharmacy -0.0705139 -11.0
South Carolina Transit 0.0696270 -45.0
Michigan Transit 0.0689687 -46.0
Minnesota Grocery_Pharmacy 0.0677111 -6.0
West Virginia Transit -0.0665876 -45.0
Mississippi Workplace -0.0663252 -33.0
Montana Retail_Recreation -0.0657087 -51.0
Oklahoma Workplace 0.0638350 -31.0
Minnesota Residential -0.0631781 17.0
Montana Grocery_Pharmacy -0.0624158 -16.0
Florida Grocery_Pharmacy 0.0607604 -14.0
Maine Grocery_Pharmacy -0.0601872 -13.0
Washington Parks 0.0590711 -3.5
Ohio Retail_Recreation 0.0583549 -36.0
Mississippi Parks -0.0576052 -25.0
Texas Retail_Recreation 0.0557888 -40.0
Missouri Parks 0.0551503 0.0
Iowa Grocery_Pharmacy -0.0518764 4.0
South Carolina Grocery_Pharmacy 0.0513576 1.0
Utah Grocery_Pharmacy 0.0510581 -4.0
Kentucky Workplace -0.0486646 -36.0
Illinois Grocery_Pharmacy -0.0470340 2.0
Iowa Workplace 0.0470287 -30.0
Alabama Residential -0.0437721 11.0
Florida Transit -0.0395526 -49.0
New Hampshire Workplace 0.0384374 -37.0
Indiana Transit 0.0382258 -29.0
South Carolina Retail_Recreation -0.0359170 -35.0
Washington Retail_Recreation -0.0352852 -42.0
Ohio Workplace -0.0325889 -35.0
Missouri Retail_Recreation -0.0311482 -36.5
Colorado Grocery_Pharmacy -0.0302632 -17.0
New Hampshire Transit -0.0263309 -57.0
South Dakota Grocery_Pharmacy 0.0244719 -9.0
Tennessee Grocery_Pharmacy 0.0239289 6.0
Illinois Retail_Recreation 0.0236308 -40.0
Colorado Retail_Recreation -0.0231352 -44.0
New Mexico Workplace 0.0194282 -34.0
Kansas Residential -0.0164430 13.0
West Virginia Retail_Recreation -0.0162670 -38.5
Missouri Grocery_Pharmacy 0.0144908 2.0
Kansas Retail_Recreation -0.0138992 -37.0
Wisconsin Retail_Recreation 0.0137810 -44.0
Oklahoma Retail_Recreation 0.0119811 -31.0
Vermont Transit -0.0116465 -63.0
Iowa Retail_Recreation -0.0115945 -38.0
Iowa Residential -0.0111754 13.0
Colorado Workplace 0.0105232 -39.0
Georgia Transit -0.0101039 -35.0
Oklahoma Transit 0.0083523 -26.0
Nevada Grocery_Pharmacy 0.0082119 -12.5
Tennessee Retail_Recreation -0.0080096 -30.0
Arkansas Grocery_Pharmacy -0.0071453 3.0
Nevada Workplace 0.0067490 -40.0
Tennessee Transit -0.0052760 -32.0
Maryland Parks -0.0022548 27.0
Nebraska Parks 0.0009274 55.5
Alaska Parks NA 29.0
District of Columbia Retail_Recreation NA -69.0
District of Columbia Grocery_Pharmacy NA -28.0
District of Columbia Parks NA -65.0
District of Columbia Transit NA -69.0
District of Columbia Workplace NA -48.0
District of Columbia Residential NA 17.0
# sanity check
ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(x=Total_confirmed_cases.per100,fill=variable))+geom_histogram()+
  facet_grid(~Province.State)+
    default_theme+
  theme(legend.position = "bottom")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

write_plot(mobility.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.plot.png"
write_plot(mobility.global.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.global.plot.png"
(plot_data.permobility_summary.plot<-ggplot(plot_data.permobility_summary,aes(x=variable,y=median_change))+
  geom_jitter(size=2,width=.2)+
  #geom_jitter(data=plot_data.permobility_summary %>% arrange(-abs(median_change)) %>% head(n=15),aes(col=Province.State),size=2,width=.2)+
  default_theme+
  ggtitle("Per-Sate Median Change in Mobility")+
  xlab("Mobility Meaure")+
  ylab("Median Change from Baseline"))

write_plot(plot_data.permobility_summary.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/plot_data.permobility_summary.plot.png"

DELIVERABLE MANIFEST

The following link to commited documents pushed to github. These are provided as a convienence, but note this is a manual process. The generation of reports, plots and tables is not coupled to the execution of this markdown. ## Report This report, html & pdf

Plots

github_root<-"https://github.com/sbs87/coronavirus/blob/master/"

plot_handle<-c("Corona_Cases.world.long.plot",
               "Corona_Cases.world.loglong.plot",
               "Corona_Cases.world.mortality.plot",
               "Corona_Cases.world.casecor.plot",
               "Corona_Cases.city.long.plot",
               "Corona_Cases.city.loglong.plot",
               "Corona_Cases.city.mortality.plot",
               "Corona_Cases.city.casecor.plot",
               "Corona_Cases.city.long.normalized.plot",
               "Corona_Cases.US_state.lm.plot",
               "Corona_Cases.US_state.summary.plot")


deliverable_manifest<-data.frame(
  name=c("World total & death cases, longitudinal",
         "World log total & death cases, longitudinal",
         "World mortality",
         "World total & death cases, correlation",
         "City total & death cases, longitudinal",
         "City log total & death cases, longitudinal",
         "City mortality",
         "City total & death cases, correlation",
         "City population normalized total & death cases, longitudinal",
         "State total cases (select) with linear model, longitudinal",
         "State total cases, longitudinal"),
  plot_handle=plot_handle,
  link=paste0(github_root,"results/",plot_handle,".png")
)


(tmp<-data.frame(row_out=apply(deliverable_manifest,MARGIN = 1,FUN = function(x) paste(x[1],x[2],x[3],sep=" | "))))
##                                                                                                                                                                                                        row_out
## 1                                           World total & death cases, longitudinal | Corona_Cases.world.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
## 2                                 World log total & death cases, longitudinal | Corona_Cases.world.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
## 3                                                         World mortality | Corona_Cases.world.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
## 4                                      World total & death cases, correlation | Corona_Cases.world.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
## 5                                              City total & death cases, longitudinal | Corona_Cases.city.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
## 6                                    City log total & death cases, longitudinal | Corona_Cases.city.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
## 7                                                            City mortality | Corona_Cases.city.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
## 8                                         City total & death cases, correlation | Corona_Cases.city.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
## 9  City population normalized total & death cases, longitudinal | Corona_Cases.city.long.normalized.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
## 10                     State total cases (select) with linear model, longitudinal | Corona_Cases.US_state.lm.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
## 11                                      State total cases, longitudinal | Corona_Cases.US_state.summary.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png
row_out<-apply(tmp, 2, paste, collapse="\t\n")
name handle link
World total & death cases, longitudinal Corona_Cases.world.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
World log total & death cases, longitudinal Corona_Cases.world.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
World mortality Corona_Cases.world.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
World total & death cases, correlation Corona_Cases.world.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
City total & death cases, longitudinal Corona_Cases.city.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
City log total & death cases, longitudinal Corona_Cases.city.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
City mortality Corona_Cases.city.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
City total & death cases, correlation Corona_Cases.city.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
City population normalized total & death cases, longitudinal Corona_Cases.city.long.normalized.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
State total cases (select) with linear model, longitudinal Corona_Cases.US_state.lm.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
State total cases, longitudinal Corona_Cases.US_state.summary.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png

Tables

CONCLUSION

Overall, the trends of COVID-19 cases is no longer in log-linear phase for world or U.S. (but some regions like MD are still in the log-linear phase). Mortality rate (deaths/confirmed RNA-based cases) is >1%, with a range depending on region. Mobility is not a strong indicator of caseload (U.S. data).

See table below for detailed breakdown.

Question Answer
What is the effect on social distancing, descreased mobility on case load?
There is not a strong apparent effect on decreased mobility (work, grocery, retail) or increased mobility (at residence, parks) on number of confirmed cases, either as a country (U.S.) or state level. California appears to have one of the best correlations, but this is a mixed bag
What is the trend in cases, mortality across geopgraphical regions?
The confirmed total casees and mortality is overall log-linear for most countries, with a trailing off beginning for most (inlcuding U.S.). On the state level, NY, NJ, PA starting to trail off; MD is still in log-linear phase. Mortality and case load are highly correlated for NY, NJ, PA, MD. The mortality rate flucutates for a given region, but is about 3% overall.

END

End: ##—— Tue Jun 2 10:17:53 2020 ——##

Cheatsheet: http://rmarkdown.rstudio.com>

Sandbox

# Geographical heatmap!
install.packages("maps")
library(maps)
library
mi_counties <- map_data("county", "pennsylvania") %>% 
  select(lon = long, lat, group, id = subregion)
head(mi_counties)

ggplot(mi_counties, aes(lon, lat)) + 
  geom_point(size = .25, show.legend = FALSE) +
  coord_quickmap()
mi_counties$cases<-1:2226
name_overlaps(metadata,Corona_Cases.US_state)

tmp<-merge(Corona_Cases.US_state,metadata)
ggplot(filter(tmp,Province.State=="Pennsylvania"), aes(Long, Lat, group = as.factor(City))) +
  geom_polygon(aes(fill = Total_confirmed_cases), colour = "grey50") + 
  coord_quickmap()


ggplot(Corona_Cases.US_state, aes(Long, Lat))+
  geom_polygon(aes(fill = Total_confirmed_cases ), color = "white")+
  scale_fill_viridis_c(option = "C")
dev.off()


require(maps)
require(viridis)

world_map <- map_data("world")
ggplot(world_map, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill="lightgray", colour = "white")

head(world_map)
head(Corona_Cases.US_state)
unique(select(world_map,c("region","group"))) %>% filter()

some.eu.countries <- c(
  "US"
)
# Retrievethe map data
some.eu.maps <- map_data("world", region = some.eu.countries)

# Compute the centroid as the mean longitude and lattitude
# Used as label coordinate for country's names
region.lab.data <- some.eu.maps %>%
  group_by(region) %>%
  summarise(long = mean(long), lat = mean(lat))

unique(filter(some.eu.maps,subregion %in% Corona_Cases.US_state$Province.State) %>% select(subregion))
unique(Corona_Cases.US_state$Total_confirmed_cases.log)
ggplot(filter(Corona_Cases.US_state,Date=="2020-04-17") aes(x = Long, y = Lat)) +
  geom_polygon(aes( fill = Total_confirmed_cases.log))+
  #geom_text(aes(label = region), data = region.lab.data,  size = 3, hjust = 0.5)+
  #scale_fill_viridis_d()+
  #theme_void()+
  theme(legend.position = "none")
library("sf")
library("rnaturalearth")
library("rnaturalearthdata")

world <- ne_countries(scale = "medium", returnclass = "sf")
class(world)
ggplot(data = world) +
    geom_sf()

counties <- st_as_sf(map("county", plot = FALSE, fill = TRUE))
counties <- subset(counties, grepl("florida", counties$ID))
counties$area <- as.numeric(st_area(counties))
#install.packages("lwgeom")
class(counties)
head(counties)
ggplot(data = world) +
    geom_sf(data=Corona_Cases.US_state) +
    #geom_sf(data = counties, aes(fill = area)) +
  geom_sf(data = counties, aes(fill = area)) +
   # scale_fill_viridis_c(trans = "sqrt", alpha = .4) +
    coord_sf(xlim = c(-88, -78), ylim = c(24.5, 33), expand = FALSE)


head(counties)
tmp<-unique(select(filter(Corona_Cases.US_state,Date=="2020-04-17"),c(Lat,Long,Total_confirmed_cases.per100)))
st_as_sf(map("county", plot = FALSE, fill = TRUE))

join::inner_join.sf(Corona_Cases.US_state, counties)

library(sf)
library(sp)

nc <- st_read(system.file("shape/nc.shp", package="sf"))
class(nc)


spdf <- SpatialPointsDataFrame(coords = select(Corona_Cases.US_state,c("Lat","Long")), data = Corona_Cases.US_state,
                               proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))

head(spdf)
class(spdf)
st_cast(spdf)

filter(Corona_Cases.US_state.summary,Date=="2020-04-20" & Province.State %in% top_states_modified)
id

https://stevenbsmith.net